Clustering high-throughput sequencing data with Poisson mixture models

نویسندگان

  • Andrea Rau
  • Gilles Celeux
  • Marie-Laure Martin-Magniette
  • Cathy Maugis-Rabusseau
چکیده

In recent years gene expression studies have increasingly made use of next generation sequencing technology. In turn, research concerning the appropriate statistical methods for the analysis of digital gene expression has flourished, primarily in the context of normalization and differential analysis. In this work, we focus on the question of clustering digital gene expression profiles as a means to discover groups of co-expressed genes. We propose two parameterizations of a Poisson mixture model to cluster expression profiles of high-throughput sequencing data. A set of simulation studies compares the performance of the proposed models with that of an approach developed for a similar type of data, namely serial analysis of gene expression. We also study the performance of these approaches on two real high-throughput sequencing data sets. The R package HTSCluster used to implement the proposed Poisson mixture models is available on CRAN. Key-words: Mixture models, clustering, co-expression, RNA-seq, EM-type algorithms ∗ INRA, UMR 1313 GABI, Jouy-en-Josas, France † INRIA Saclay Île-de-France, Orsay, France ‡ UMR INRA 1165 UEVE, ERL CNRS 8196, Unité de Recherche en Génomique Végétale, Evry, France § UMR AgroParisTech/INRA MIA 518, Paris, France ¶ Institut de Mathématiques de Toulouse, INSA de Toulouse, Université de Toulouse Classification de données de séquençage à haut-débit avec les modèles de mélange de Poisson Résumé : De plus en plus, les études d’expression de gènes utilisent les techniques de séquençage de nouvelle génération, entraînant une recherche grandissante sur les méthodes les plus appropriées pour l’exploitation des données digitales d’expression, à commencer pour leur normalisation et l’analyse différentielle. Ici, nous nous intéressons à la classification non supervisée des profils d’expression pour la découverte de groupes de gènes coexprimés. Nous proposons deux paramétrisations d’un modèle de mélange de Poisson pour classer des données de séquençage haut-débit. Par des simulations, nous comparons les performances de ces modèles avec des méthodes similaires conçus pour l’analyse en série de l’expression des gènes (SAGE). Nous étudions aussi les performances de ces modèles sur deux jeux de donnnées réelles. Le package R HTSCluster associé à cette étude est disponible sur le CRAN. Mots-clés : Modèles de mélange, classification, co-expression, RNA-seq, algorithmes de type EM Clustering HTS data with Poisson mixture models 3

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Multivariate Poisson-Log Normal Mixture Model for Clustering Transcriptome Sequencing Data

High-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies. Analyzing the network itself or the interplay between genes in this type of data continues to present many challenges. As data visualization techniques become cumbersome for higher dimensions and unconvincing when there is no clear separation between homogeneous subgroups within th...

متن کامل

Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models

MOTIVATION In recent years, gene expression studies have increasingly made use of high-throughput sequencing technology. In turn, research concerning the appropriate statistical methods for the analysis of digital gene expression (DGE) has flourished, primarily in the context of normalization and differential analysis. RESULTS In this work, we focus on the question of clustering DGE profiles ...

متن کامل

Gene expression Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models

Motivation: In recent years, gene expression studies have increasingly made use of high-throughput sequencing technology. In turn, research concerning the appropriate statistical methods for the analysis of digital gene expression (DGE) has flourished, primarily in the context of normalization and differential analysis. Results: In this work, we focus on the question of clustering DGE profiles ...

متن کامل

Bayesian Mixture Models for Gene Expression and Protein Profiles

We review the use of semi-parametric mixture models for Bayesian inference in high throughput genomic data. We discuss three specific approaches for microarray data, for protein mass spectrometry experiments, and for SAGE data. For the microarray data and the protein mass spectrometry we assume group comparison experiments, i.e., experiments that seek to identify genes and proteins that are dif...

متن کامل

Two-way Poisson mixture models for simultaneous document classification and word clustering

An approach to simultaneous document classification and word clustering is developed using a two-way mixture model of Poisson distributions. Each document is represented by a vector with each dimension specifying the number of occurrences of a particular word in the document in question. As a collection of documents across several classes usually makes use of a large number of words, the docume...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011